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Abstract 

The problem of estimating the directed information rate between two discrete processes 
{Xn} and {Yn} via the plug-in (or maximum-likelihood) estimator is considered. When the 
joint process {{Xn,Yn)} is a Markov chain of a given memory length, the plug-in estimator 
is shown to be asymptotically Gaussian and to converge at the optimal rate 0(1/y/n) under 
appropriate conditions; this is the first estimator that has been shown to achieve this rate. An 
important connection is drawn between the problem of estimating the directed information 
rate and that of performing a hypothesis test for the presence of causal influence between the 
two processes. Under fairly general conditions, the null hypothesis, which corresponds to the 
absence of causal influence, is equivalent to the requirement that the directed information 
rate be equal to zero. In that case a finer result is established, showing that the plug¬ 
in converges at the faster rate 0(l/n) and that it is asymptotically y^-distributed. This is 
proved by showing that this estimator is equal to (a scalar multiple of) the classical likelihood 
ratio statistic for the above hypothesis test. Finally it is noted that these results facilitate 
the design of an actual likelihood ratio test for the presence or absence of causal influence. 
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1 Introduction 


Hypothesis testing and mutnal information. One of the most prevalent statistical tools 
used universally across the sciences today, is the test for independence. Suppose we have 
independent and identically distributed data pairs (Xi, Yi), (X 2 ,Y 2 ),..., {Xn-,Yn) and we wish 
to test whether the X and Y variables are independent or not. Assuming both sets of variables 
take on finitely many values, we can compare the joint empirical distribution PxY,n{o-, b) with the 
product of the empirical marginals Px,n{a)PY,n{b); as usual, PxY,n{cL,b) denotes the proportion 
of times the pair (a, b) appears in the whole sample, and similarly for the marginals. Pearson’s 
test dictates that we compute the (normalized) distance between these two distributions. 


xl = nY^ 

a,b 


[PxY,n{a,b) - Px,n{a)PY,n{b)]‘ 

Px,n{o)PY,n{b) 


( 1 ) 


If the data are indeed independent then the distribution of the statistic y„, for large sample sizes 
n, is approximately y^ with (m — 1)(£ — 1) degrees of freedom, where m,f are the sizes of the 
alphabets of X and Y, respectively. Therefore, we can compute the probability of observing a 
value greater than or equal to y^ under this distribution, and if this probability is appropriately 
small then we can reject the independence hypothesis. 

Another classical test, more closely related to information theoretic ideas, is the likelihood 
ratio test, based on the statistic. 


An = 2 log 


IVl=i PxYAXuYt) 
nr=l Px,niXi)PY,niYi) 


This has the exact same asymptotic distribution as y^, and an analogous test can be performed. 
In fact, the y^ statistic can be viewed (and sometimes its use is thus justified) as a quadratic 
approximation to the nonlinear statistic A„. A more important observation for our purposes is 
that, after simple algebra, the likelihood ratio test statistic can exactly be expressed as a mutual 
information. 


An = 2nI{X;Y) = 2nD[PxY,n\\Px,nPY,n}^ 


( 2 ) 


where the random variables X, Y have distribution PxY,n- Therefore, instead of the y^ distance 
used in the likelihood ratio test statistic (I2|) examines the (normalized) relative entropy 
distance between PxY,n and Px,nPY,n- And yet another way to interpret A„ is as the “plug-in” 
estimate of the mutual information I{Xi]Yi) of the data, using their empirical distribution. 

The asymptotic distribution of A„ has been re-derived several times historically. In its 
general form it goes back to the classical result of Wilks [l2] , see also the texts [23l |35] ; in more 
recent years it has also reappeared in an information-theoretic context, see, e.g., HaEKH. 

Estimating directed information and causality testing. This work examines the problem 
of estimating a different information-theoretic functional: If X = {X„} and Y = {W.} are 
two finite-valued random process, then the directed information I{Xf Y^) between Xf = 
(Xi, X 2 ,..., X„) and Y{^ = (Ti, Ts, • • •, X„) is defined as. 


I(Xr ^ YA = HiYA - V HiYi\YA\Xl) 


i=l 


"It) 
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and the directed information rate between X and Y is, 

I{X ^Y)= lim -I{Xf Yf^), (3) 

n^oo Tl 

whenever the limit exists; precise definitions are given in Sections [2] and [3l Directed information 
was introduced by Massey [25], building on earlier work by Marko [23], in order to provide 
capacity characterizations for channels with causal feedback. Subsequent work in this direction 
includes [lIllMlliaEolEgiHI, and the dual problem of lossy data compression with feedforward 
is treated, e.g., in m- Additional areas where directed information plays an important role 
include, among others, distributed and causal data compression and hypothesis testing m 
129] . network communications and control [221 ETj, dynamically switching networks m , sensor 
networks [331 US ] , and causal estimation mi; see also the references in the above works. 

Here we consider the problem of estimating the directed information rate, by tracing the path 
described above in connection with the mutual information in the reverse direction. Assuming 
that the joint process is a Markov chain of memory length A; > 1, we begin by 

recalling that under fairly general conditions the limit ([3|) can be expressed in more manageable 
form. For example, in Proposition 13.11 we note that when {Yn} itself is also a Markov chain 
of order no greater than k, the directed information rate is equal to the conditional mutual 
information I[Yq] X^f^\Y~j ^), where {(A„,i^)} denotes the stationary version of the chain. 

Main results. In Section [3. 2 1 we define what is probably the simplest estimator of I{X —)• Y): 
Interpreting the conditional mutual information I{Yq] Xfj^\Y~^) as a functional of the {k + 1)- 

dimensional distribution of {X^j^,Yf^f^), we define the plug-in estimator iit\x —)■ Y) as the 
same functional of the corresponding empirical distribution. If the original chain is ergodic, 
then it is easy to see that this estimator is consistent with probability one. Our main results, 
stated in Theorems 13.21 and 13.31 give finer information for this convergence; cf. Q and Q below. 

On the one hand, if I{X —>• 1^) > 0 we show that, under appropriate conditions, the plug-in 
estimator is approximately Gaussian for large n, 

/W(X ^Y)^N (^liX ^ F), , (4) 

where the variance cr^ is identified in Theorem 13.21 Therefore, the plug-in estimator converges 
at a rate 0{l/^/n) in probability. In fact. Corollary 13.41 shows that the same rate holds in 
L^, and in view of |18[ Proposition 3] this implies that the plug-in estimator is optimal in 
that it converges at the fastest possible rate. Moreover, in Corollary 13.41 we also establish the 
almost-sure convergence rate of the plug-in estimator, under fairly general conditions. 

On the other hand we note that I{X —)• Y) = 0 if and only if a certain conditional in¬ 
dependence property holds, which can be interpreted as the absence of causal influence from 
X to Y: Roughly speaking I[X — >• Y) is zero if and only if each Yj, given the past values of 
the Y process, is conditionally independent of the values of the X process up to time i. In 
fact, one of the main contributions of this work is the identification of two related problems: 
(1.) understanding the asymptotics of the plug-in estimator in\x —)• 1^); and (2.) analyzing 
the likelihood ratio test for the above causality hypothesis. 

Intuitively, determining whether the estimated directed information in\x —)• Y^) is sig¬ 
nificantly close to zero or not, is related to testing the hypothesis that the above conditional 
independence relationship holds. Formally, as we show in Proposition 13.51 the (normalized) 
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likelihood ratio statistic for this test is exactly equal to the plug-in estimator in\x —)• 1^). 
This connection is described in detail in Section 13.31 Apart from being intellectually satisfying, 
it also allows us to derive the exact asymptotic distribution of in\x —)• Y). 

Indeed, we show that, under appropriate conditions, if the directed information rate is zero, 
then a hner result than (Uj) can be established for the plug-in estimator; for large n we have, 

(5) 

n 

where is an appropriate distribution that only depends on the sizes m,i of the 

alphabets of X and Y and on the memory size k. In other words, under the null hypothesis 
(which asserts the presence of the conditional independence being tested), the plug-in estimator 
converges at a faster rate 0(I/n), and the distribution to which it converges only depends on 
the three known parameters m,i and k. Therefore, a likelihood ratio test for this type of causal 
influence, can again be performed as before. 

In Section [2] we consider the simpler problem of estimating the mutual information rate 
Huin I{Xn', Xq~^) of a Markov chain {Xn}- The results presented there can be viewed as pre¬ 
liminary versions of the analogous results for the directed information given in Section [3j For 
clarity of exposition, all proofs are collected in the Appendix. 

Earlier work. The connection between the problem of identifying causal relationships and 
that of testing for conditional independence has a long history. Perhaps the most prominent 
example is the Granger causality test m, which uses an autoregressive model (later extended 
in several directions, most notably to generalized linear models), within which the conditional 
independence hypothesis described above is tested. The connection of this test with directed 
information has previously been explored in several directions; see [2] for a comprehensive review. 

Several different approaches to the problem of directed information estimation have appeared 
in the literature in recent years. Rao et al. |34| use Miller ’s 127! differential entropy estimators 
in order to estimate (the continuous analog of) directed information, in order to identify causal 
influence in networks of genes. In the context of neuroscience, Quinn et al. [SH use parametric 
estimation based on generalized linear models to estimate directed information, in order to 
detect the presence of direct or indirect influence in neuronal networks. And in the subsequent 
work [32], a near-optimal rate of convergence is established for the plug-in estimator. 

In terms of the present development, the most interesting work is the recent paper by Jiao 
et al. |18| . There, several new estimators for the directed information rate are introduced 
and they are shown to be consistent under very general conditions For some of these estima¬ 
tors, particularly those based on the context tree weighting algorithm [33|, detailed convergence 
bounds are also obtained. It is worth noting that our convergence results are obtained under 
conditions essentially identical to (though slightly weaker than) those required for the bounds 
in [18]. But using the plug-in also facilitates the connection with hypothesis testing developed 
here, and makes it possible to obtain, instead of convergence bounds, accurate and sometimes 
exact asymptotics as described above. 

Finally, in a broader context, we note the and convergence results presented for mutual 
information in the very recent work m- Although the problems treated there are mostly in the 
case of independent observations, they provide a general minimax framework for examining the 
asymptotic optimality of different estimators. 
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Different approaches. The problem of estimating directed information via the plug-in can 
and has arisen as a natural question within several different areas. Motivated by applications in 
econometrics, one of its earliest appearances is in m, where the directed information functional 
is defined as a Kullback causality measure and is derived as the limiting form of a likelihood 
ratio statistic used in a temporal causality hypothesis test that is closely related to the test 
we describe in Section 13.31 For the special case of first-order Markov chains (and under some 
additional assumptions), the order-(l/n) convergence rate of this statistic to the distribution 
is discussed in unj. 

In the physics literature directed information has also appeared extensively under the name 
transfer entropy, and the performance of the corresponding plug-in estimator is examined in 
the recent works HE]- There, several asymptotic results as well as non-asymptotic bounds are 
described, though the main emphasis appears to be less on providing rigorous proofs and more 
on exploring the possible qualitative asymptotic properties of the plug-in. 

Arguably the most effective, unifying approach to the task of understanding the behavior 
of the plug-in estimator for directed information comes from taking the point of view of the 
asymptotic analysis of maximum likelihood estimates in theoretical statistics. This is indeed the 
point of view adopted in this work, and for the proofs of our main results we rely on two relevant 
sets of tools: One is the classical central limit theorem and the law of the iterated logarithm 
for Markov chains [7], and the other is the general asymptotic theory of statistical inference for 
Markovian observations [5]. 

In this vein it should be pointed out (as one of the anonymous reviewers generously outlined 
at length in their report) that the modern development of classical asymptotics in statistical de¬ 
cision theory provides a powerful technical approach to the task at hand, which can lead to what 
are apparently the strongest possible results, and accompanied by the sharpest intuition. That 
development, pioneered by Le Cam, Hajek and others since the 1980s, cf. PES], is built around 
the “local asymptotic normality,” or LAN, condition. This requires that the log-likelihood ratio 
between models with nearby parameters, when evaluated at observations produced by a fixed, 
true distribution corresponding to one of these two parameters, can be appropriately approxi¬ 
mated asymptotically by a multivariate Gaussian. For the class of Markov models considered in 
the present context, this is not hard to verify, as was done, e.g., in [28]. Then, given the LAN 
condition, the asymptotic behavior of the plug-in, as well its strong optimality properties, can 
be established by applications of what are, by now, standard results in this area; see the local 
asymptotic minimax theorem and the convolution theorem in |39l Sections 8.7, 8.9]. 
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2 Mutual information 


2.1 Preliminaries 

Suppose X is a discrete random variable with values in a finite set A, and with a distribution 
described by its probability mass function, Px{x) = Pr{X = x}, for x G A. The entropy of 
Xis, H{X)=H{Px) = -E.^a P{x)log P{x), where, throughout the paper, ‘log’ denotes the 
natural logarithm. Viewed as a single random element, the joint entropy of any finite collection 
of random variables X” = (Xi, X 2 ,...,X„) is dehned analogously; the mutual information 
between two random variables X and V is I{X;Y) = H{X) -\- H{Y) — H{X,Y)] the conditional 
entropy H{X\Y) = H{X,Y) — H{Y); and the conditional mutual information I[X]Y\Z) = 
H{X\Z) + H{Y\Z) — H{X,Y\Z). As above, we generally write x/ = (Xj,Xj+i,... ,Xj), i < j, 
for vectors of random variables and similarly = (ai,aj+i,... ,aj) G i < j, for strings 

of individual symbols from a finite set A. 

The joint distribution of an arbitrary number of discrete random variables is described by 
their joint probability mass function. For example, the joint distribution of (X, Y, Z) is denoted, 
PxYz{x,y, z) = Pr{X = x,Y = y,Z = z}. We write the induced marginal distributions in the 
obvious way, e.g., PxY{x,y) = Pr{X = x,Y = y} and Pz{z) = Pr{Z = z}, and the induced 
conditionals are similarly denoted, e.g., PxY\z{x,y\z) = Pr{X = y^Y = y\Z = y}. 

2.2 The plug-in estimator of mutual information 

Suppose X = {Xn} = {Xn ; n > 0} is a homogeneous, first-order Markov chain on a 
finite alphabet A, with an arbitrary initial distribution for Xq and with transition matrix 
Q = {Q{a'\a) ; a, a' G A). Although it is not necessary for much of what follows, in order 
to avoid uninteresting technicalities we assume that Q{a'\a) > 0 for all a, a' G A; see the re¬ 
marks following the statement of Theorem 12.11 regarding how this assumption can be relaxed. 
Then {X„} has a unique stationary distribution vr supported on all of A. 

Let {Xn} denote the stationary version of {Xn}, having the same transition matrix Q but 
with initial distribution Xq ~ vr. We wish to estimate the mutual information rate of the Markov 
chain {Xn}, 

lim I{Xn-,X^-^)= hm /(X„;X„_i), 

n—yoo n^oo 

which is easily seen to be equal to the mutual information of the chain in equilibrium, namely, 
4(Xo;Xi) = /(Xo;Xi) = /(PyoW) = H{Xo) + H{X,) - F(Xo,Xi). 

Let PxoXi,n denote the bivariate empirical distribution obtained from the sample Xq , 

1 

Px^Xi^nifl, Ot ) ^ '^ A^Xi_i=a,Xi=a'}j CL, Cl G A, 

i=l 

where Ie denotes the indicator function of an event E, which equals 1 when E occurs and 0 
otherwise. Then the plug-in estimator In{PxQX^) for / 7 r(Xo;Xi) = I{PxoXx) defined as, 

^niPxoXi) = HPXoXun) = H{Pxo,n) + H{Pxi,n) “ H{PxoXun), 

where we use the same convention for the notation of marginals, e.g., Pxo,n{cL), and conditionals, 
e.g., Pxr\Xo,nW\o) with those described in Section [2Tl 
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Recall that, since all the transition probabilities of {Xn} are nonzero, the bivariate chain 
{Zn = {Xn-i, Xn) ] n > 1} on ^4 X ^ is ergodic, so the ergodic theorem [7] implies that, 
as n —>■ oo, Pxo,n ^r, Pxi,n ^r, and PxoXi,n PxqXi^ almost surely (a.s.). Therefore, 
In{PxoXx) converges to the desired value, I{Xq;Xi), a.s., as n —oo. The following result 
describes its hner asymptotic behavior; its proof is given in Appendix lA. 11 


Theorem 2.1 Let X = {Xn} be a Markov chain with an all positive transition matrix Q = 
((5(a'|a)) on the finite alphabet A, and with an arbitrary initial distribution. 


(i) If the random variables {Xn} are not independent, equivalently, if I t^{Xq-, Xi) > 0, then. 


y/n[in{PxQXi) - ^^(0,0-^), asn^oo, 


where —denotes convergence in distribution, N{0,a‘^) is the zero-mean normal dis¬ 
tribution with variance , and with given by the following limit, which exists and is 
finite: 


a 


2 


lim — Var 

n^oo 77, 


log 



J 


( 6 ) 


(ii) If the random variables {Xn} are independent, equivalently, if It^{Xq;Xi) = 0, then, 

2nin{Px^Xi) ^ asn^oo, 

where x^{s) denotes the distribution with s degrees of freedom, and m = \ A\ is the size 
of the alphabet. 


Remarks. 

1. As will become evident from the proof, the restriction that all the transition probabilities 
Q{a'\a) of the chain {Xn} are positive, is unnecessary. Indeed, for the result of part (z) 
it can be entirely removed, and replaced with the minimal assumption that {Xn} is irre¬ 
ducible and aperiodic. Similarly, for part (ii) the positivity assumption can be signihcantly 
relaxed. For example, of Theorem 5.2 of [S] gives weaker conditions under which the same 
conclusions can be obtained, e.g., if we restrict attention to a class of ergodic chains whose 
transition matrices are allowed to contain zero probabilities, but the zeros always occur at 
the same state transitions. 


2. One of the main messages of Theorem 12.11 is the clear dichotomy between independence 
and dependence: If the random variables {Xn} are independent, then It^{Xq]Xi) = 0 and 
the plug-in estimator In{PxoXi) converges at a rate 0(l/n). On the other hand, if the 
{Xn} are not independent, then lTr{XQ]Xi) is strictly positive and the plug-in estimator 
IniPxoXi) converges at the slower rate 0{l/^/n). 

There is a minor caveat in the above syllogism, in that it is only valid as long as 
is strictly positive; when cr^ = 0, then even if the {Xn} are not independent, the plug¬ 
in estimator In{PxoXi) converges at a rate faster than 0(1/-^^). But it is easy to see, 
intuitively, that cr^ is typically nonzero when In{PxoX]) positive. In the special case of 
chains with a uniform stationary distribution, this is illustrated by Proposition 12.21 below, 
proved in Appendix lA.21 
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3. As a consequence of the proof of Theorem 12.11 it is fairly simple to determine the exact 
a.s. rate of convergence of the plug-in estimator under very general conditions. This is 
stated in Corollary 12.31 below. 

4. For the proof of the second part of the theorem we will exploit a connection between the 
problem of estimating the mutual information /^(Xq; Ai) and a classical hypothesis test 
for independence, as outlined in the following section. 

Proposition 2.2 Suppose the stationary distribution vr of the chain {X„} is uniform on A or, 
equivalently, that the transition matrix (Q(a'|a)) is doubly stochastic. Then the variance 
defined in m is zero if and only if the {Xn} are i.i.d. and each Xn is uniformly distributed 
on A. 

The final result of this section gives the exact pointwise rate of convergence for the plug-in 
estimator; it is established in Appendix IA.31 

Corollary 2.3 Let X = {Xn} be an irreducible and aperiodic Markov chain on the alphabet A, 
with an arbitrary initial distribution. Then, as n —>■ oo, the plug-in estimator satisfies, 

UPxoX,) = In{Xo-,X,) + O a.s. 

2.3 A hypothesis test for independence 

Suppose we wish to test the null hypothesis that the random variables {Xn} are independent, 
within the larger hypothesis that {Xn} is a Markov chain with all positive transitions. Take, 
without loss of generality, the alphabet to be A = {1, 2,... , m}, where m = |A|. Then, we can 
parametrize all possible transition matrices Q = Qe with all-positive transition probabilities, 
by an m{m — l)-dimensional vector 0 restricted an open set 0 C Similarly, the null 

hypothesis is specified by a lower-dimensional open set 4> C which is naturally embedded 

within 0 via a map h : —)• 0. The details of the parametrization and the embedding are 
given in the proof of Theorem 12.11 in Appendix lA.il but, informally, indexes those transition 
matrices Qh^cp) that consist of m identical rows, exactly corresponding to those Markov chains 
that consist of independent random variables {Xn}. 

In order to test the (composite) null hypothesis within the general model 0, following 
classical statistical theory we employ a maximum-likelihood ratio test. Specifically, if we define 
the log-likelihood L„(Xq ; 0 ) of the sample Xq under the distribution corresponding to 6 as, 

n 

Ln{X^; e) = log [Pr 0 (Xr|Xo)] = log ( n Qe{Xi\Xi_i)) , (7) 

i=l 

then the likelihood ratio test statistic is simply the difference. 

An = 2 |maxL„(XQ ; 0 ) - maxL„(XQ ; h{(j))) \ . ( 8 ) 

eeo (/)€$ J 

In terms of hypothesis testing, there are two important observations to be made here. The 
first, is that this statistic is exactly equal to 2 n times the plug-in estimator In{PxoXi)^ 
computation showing this is performed in Appendix IA.4I 
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Proposition 2.4 Under the assumptions of Theorem \2.1\ and in the notation of this section: 

An = 2nin{PxQXi)- 

The second important thing to note is that, under the null hypothesis, that is, assuming that 
the random variables {Xn} are independent, part (ii) of Theorem l2.1l tells us that the distribution 
of An = 2nIn{PxgX^) is approximately x^((m — 1)^), which does not depend on the distribution 
of the data, except only through the alphabet size m. Therefore, we can decide whether or not 
the data Xq offer strong enough evidence to reject the null hypothesis by examining the value 
of An = 2nIn{PxQXi) then computing a p-value based on this distribution. 

Conversely, as we shall see in the proof of {ii) of Theorem 12.11 the asymptotic properties of 
the estimator In{PxoXi) be deduced from general results about the likelihood ratio A„. 


3 Directed information 

3.1 The directed information rate of Markov chains 

Let X = {Xn} and Y = {Yn} be two arbitrary processes with values in the finite alphabets A 
and B, respectively. Recall that the direeted information between X^ and Yj” is, 

n 

I{X^ ^ Y”) = H{Y{^) -Y,H{Y,\Yf-\Xl), 

i=l 

and that it is zero exactly when each Yj is conditionally independent of XI, given its past 
Yf~^. The natural interpretation of this equivalence is to say that the directed information 
is zero if and only if X has no eausal influence on Y. We are interested in the problem of 
estimating the directed information rate between X and Y, defined as the limit, I{X —?■ Y) = 
lim„^oo(l/n)/(Xp —>■ Y”), whenever it exists. 

If the pairs {{Xn, Yn)} are independent and identically distributed, then it is easy to see that 
I{X —)• Y) simplifies to I{Xi; Yi), and the problem of estimating it reduces to that discussed in 
the Introduction. Of course, in this case there is nothing to discover regarding causal dependence. 

From now on we assume that the pair process {{Xn,Yn) ; n > —k + 1} is an ergodic 
(namely, irreducible and aperiodic) Markov chain on the product alphabet A x B, of memory 
length A: > I, and with an arbitrary initial distribution for Y^^_|_^). We write {{Xn, Yn)} 

for the stationary version of {{Xn, Yn)} with Y_))^_|_^) distributed according to the unique 

invariant measure of the bivariate chain, and recall that the distribution of {{Xn,Yn)} can be 
extended so that it is defined for all n = ..., — I, 0,1,.... 

In this case, the following proposition shows that, under appropriate conditions, the directed 
information rate can be expressed in simpler form. 

Proposition 3.1 Suppose {(X„,Y„)} is an irreducible and aperiodic Markov chain of order no 
larger than k, with an arbitrary initial distribution. Then: 

{€) The entropy rate H{Y) of the univariate process Y = {Yn} exists and, 

H{Y) = lim -H{Y^) = lim -H{Y^). 

n—>-oo Ti n—>-oo 71 
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(m) The directed information rate I{X —)• 1^) exists and it equals, 

I{X ^Y)= lim -/(Xr ^ Y^) = H{Y) - H{%\X^_„YT,^). 

n—>-oo n 

(Hi) If Y = {Yn} is also a Markov chain of order no larger than k, then I{X —)• Y) further 
simplifies to, 

Remarks. 

1. Throughout this section we assume that {(X„,y„}) is a Markov chain, not necessarily 
stationary (i.e., with an arbitrary initial distribution), with memory no larger than some 
fixed k. For the sake of technical convenience we will also assume that {{Xn,Yn)} has a 
strictly positive transition matrix Q, 

Q{ak,bk\at\b'^~") = = ak,Yn = bk\Xllzl = = 5 ^-^} > 0 , 

for all Oq G 6g G As discussed in the remarks following Theorem 13.31 this 

assumption can be significantly relaxed. 

2. Like mutual information, the directed information rate I{X —?■ 1^) also admits important 
operational interpretations. For example, in the case of a stationary A:th order Markov 
chain {{Xn,Yn)} such that {Yn} is also a kth. order chain, we can use the data processing 
property of mutual information in the result of part (in) of the proposition to see that, 

i{x ^Y) = = iiYo-,xljYz^). 

This quantity is zero if and only if each Yi , given its past Yff}^ , is conditionally independent 
of Xf^, confirming our original intuition that the directed information is only zero in the 
absence of causal influence. 

3. In the case of a general stationary chain {(X„, 1^)}, without assuming anything else about 
the process {Yn}, data processing still implies that, 

i(yo;xo,|yt) = iiYo-,x^_jYZj^) > i{Yo-,x^_jY-i). 

This is zero if and only if Yq, given only its A:-past Y^^, is conditionally independent of 
In this case the quantity I{Yo; Xff,\YT^) is not enough to entirely characterize the 
absence of causal influence from X to Y, but knowing its value nevertheless offers some 
evidence for such an influence. In particular, knowing that it is zero (or sufficiently close 
to zero), would still imply that X has no (or little) causal influence on Y. 

Therefore, even if Y is not necessarily Markovian, it is always of interest to estimate 
I{YQ-,X^if\YZ }^). Indeed, as we explain in detail in Section 13^ this estimation problem is 
intimately related to a likelihood-ratio hypothesis test for the presence of causal influence. 
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3.2 The plug-in estimator of directed information rate 

Given a sample from the joint process {{Xn,Yn)}, we define the (fc + 1)- 

dimensional, bivariate empirical distribution induced on x as, 


1 "■ 

^X\Y°i^,niao,bo) = -J2hxi_;,=a^,YLk=bo}' ^ ^ ' 

2=1 


(9) 


Motivated by the discussion in the remarks following Proposition 13.11 we now define the plug-in 
estimator for the directed information rate I{X —)• Y) as, 


I«(X ^ T) = I{%;X%\Y:^), where {X%A) ~ vo ,, 


( 10 ) 


Since all the transition probabilities of the bivariate chain {{Xn, In)} are nonzero, the (/c+l)- 
dimensional chain {Zn = iX2_j^,Y^_j^)} is ergodic, so the ergodic theorem implies that the 
empirical distributions Pxo yo „ converge a.s., as n ^ oo, to Pjfo yo . And hence, the plug-in 

— fc'*—fc’ ——k 

estimator in\x —^ T) also converges a.s. to the desired value, HYq] X^j^\YZ^). The following 
result describes its finer asymptotic behavior. 


Theorem 3.2 Let {(X„,l^)} he a Markov chain of memory length k >1, with an all positive 
transition matrix Q on the finite alphabet A x B, and with an arbitrary initial distribution. 
Assume that the univariate process {Tnj is also a Markov chain with memory length k. 

(i) If the random variables {Xn} do have a eausal influenee on the {Yn}, equivalently, if 
I{X —>• Y) > 0 then, 


PHx 


Y) -I{X Y) 


V 


iV(0, 


a 


as n 


oo. 


where the variance a'^ is given by the following limit, which exists and is finite: 


= lim —Var < log 

n^oo Ti 




( 11 ) 


(ii) If the random variables {Xn} do not have a causal influence on the {Tnj; equivalently, if 
I{X —)■ Y) = 0 then, 

2ni^^\x ^ Y) ^ - 1){I - 1)) , asn^oo, (12) 

where m = \A\ and I =\B\ are the sizes of the alphabets A, B, respectively. 

Theorem 13.21 is an immediate consequence of the following more general result that does not 
assume that T is a Markov chain, combined with Proposition 13.11 Theorem 13.31 is proved in 
Appendix IA.61 
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Theorem 3.3 If {{Xn,Yn)} is a Markov chain of memory length k > 1, with an all positive 
transition matrix Q on the finite alphabet Ax B, and with an arbitrary initial distribution, then: 

(i) If I{Yo; X^f,\Y~^) is nonzero, then with as in 071) ; 


iit\x ^Y)-I{Yo;X^-k\yYi^) asn^oo. 


(ii) If, on the other hand, I{Yq; Xff,\Y_^) = 0, then the plug-in estimator converges to a 
distribution exactly as in m- 


Remarks. 

1. From the proof of Theorem 13.31 it is evident that the restriction of all-positive transition 
probabilities Q{ak,bk\aQ~^,bQ~^) for the chain {(X„,y„)} is unnecessary. The result of 
part (i) remains valid with this restriction replaced with the minimal assumption that 
the pair process {{Xn,Yn)} is irreducible and aperiodic. And for part (ii) the positivity 
assumption can also be significantly relaxed, in accordance with the discussion around 
Theorem 5.2 of [5], particularly as long as the /c-dimensional version of the assumptions in 
Condition 5.1 is satisfied, as discussed in Remark 1 after Theorem O earlier. 

2. An important consequence of Theorems 13.21 and 13.31 is the clear dichotomy between the 
presence and absence of causal influence: If the {An} have no causal influence on the {Y^}, 
then I{X —>• Y) = 0 and the plug-in estimator converges at a rate 0{l/n). On the other 
hand, if such causal influence does exist, then the directed information rate I{X —)• T) is 
strictly positive and the plug-in estimator converges at the slower rate 0{l/^/n). 

3. An examination of the proof of Theorem 13.31 shows that, with some additional effort, it 
can be refined to provide very accurate results on the a.s. and rates of convergence 
of the plug-in estimator, under very general conditions; see Corollary 13.41 below, proved 
in Appendix IA.71 In particular, in view of the converse result in m Proposition 3], 
the asymptotic bound in (I14p implies that the rate at which the plug-in converges is 
optimal. 

Although it is easily established that an analogous result holds for the plug-in estimator 
of mutual information, the corresponding proof is merely a simplification of the (already 
fairly straightforward) proof of Corollary 13.41 and therefore it was not included in the 
previous section. 

4. For the proof of the convergence part of the theorem we will exploit an interesting 
connection of this problem with a classical hypothesis test for conditional independence; 
this is discussed in detail in Section 13.31 


Corollary 3.4 Let {(A„, Yn)} be an irreducible and aperiodic Markov chain on Ax B, of mem¬ 
ory length k > 1, and with an arbitrary initial distribution. Then, as n ^ oo, the plug-in 
estimator satisfies, as n ^ oo. 


/W(A^y)-/(yo;AO,|y_-i) 


o 


/log logn \ 


a.s.. 


E 



y)-/(yo;AOfc|y_t)| 



(13) 

(14) 
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If Y is also a Markov chain of memory no larger than k, then, os n —>■ oo 


^ 1") -/(X ^ Y) 


O 


/log logn \ 


a.s. 


E 



Y)-I{X^Y)\ 



( 15 ) 


In view of |18l Proposition 3], the convergence rate established in (jlSp above is optimal. 


3.3 A hypothesis test for causal influence 

Suppose we wish to test whether or not the samples {X„} have a causal influence on the {Yn}. 
As discussed already, in the present context this translates to testing the null hypothesis that 
each random variable Yi is conditionally independent of given Yf~^, within the larger 

hypothesis that the pair process {{Xn,Yn)} is a A:th order Markov chain on A x B with all 
positive transitions. We take, without loss of generality, the alphabets of X and Y to be 
A = {1, 2,... , m} and B = {1,2,, £}, respectively. 

As we describe in detail in the proof of Theorem l3.3l in Appendix lA.Gl each positive transition 
matrix Q = Qg is indexed by a parameter vector 9 taking values in an — l)-dimensional 

open set 0. And the null hypothesis corresponding to each random variable Yi being condi¬ 
tionally independent of given Yf~j{, is described by transition matrices Qg which can be 
decomposed as, 

Qg{ao,bo\aZl,bzl) = QU(^o\aZl,bZl)Qg{bo\bzl). ( 16 ) 

This is formally described by a lower-dimensional parameter set <h, which can be embedded in 
0 via a map /i : $ —0, such that all induced transition matrices Qh{(i>) correspond to Markov 
chains that satisfy the required conditional independence property (fTBI) . 

In order to test the null hypothesis within the general model 0, we employ a likelihood ratio 
test. Specifically, we define the log-likelihood 6) of the sample 

under the distribution corresponding to 0 as, 

n 

L.(AVi>^-Vi;^) = log [Pr0(Ar,yi"|XVi,lK_Vi)] = log (j{Qe{X,,Y,\Xlzl,Yi:^)),{n) 

i=l 

so that the likelihood ratio test statistic is simply the difference, 

= 2 {maxL„(AVi,y_Vi;0) - inaxL„(XVi> ^(</'))| • (18) 

As in the earlier case discussed in Section ESI there are two key observation to be made 
here. First, this statistic is exactly equal to 2n times the plug-in estimator. Proposition 13.51 is 
proved in Appendix lA. 81 

Proposition 3.5 Under the assumptions of Theorem AS.A and in the notation of this section: 

A„ = 2n/W(X^y). 
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Recall that, under the null hypothesis, part (ii) of Theorem 13.31 tells us that the distribution 
of An is approximately with — 1) degrees of freedom. The second important 

thing to note is that this limiting distribution does not depend on the actual distribution of the 
samples, except through the alphabet sizes m,£ and the memory length k. Therefore, following 
standard statistical methodology, we can decide whether or not the data offer strong enough 
evidence to reject the null hypothesis by examining the value of A„: If the p-value given by the 
probability of the tail [A„,oo) of the asymptotic distribution is below a certain threshold a, 
then the causality hypothesis can be rejected at the significance level a. 

Conversely, as we discuss in the proof of part (ii) of Theorem l3.3l the asymptotic distribution 
of the plug-in estimator under the null hypothesis, follows from the corresponding general results 
about the likelihood ratio in [5]. 
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A Appendix 


A.l Proof of Theorem 12.11 

For part (i), first, we express in{PxoXi) = I{PxoXi,n) as, 

PxoXi,n{a,a') \ 


“0 log 


Pxo,nia)Pxi,nia') 


PxoXun{a, a') log ( As:oXi,n(a,a')log 

a,a' ^ ^ a^a' 


vr(a') -Pxi|Xo,n(a'l«) 
Qia'\a) Pxi,n(a') 


1 


- X] I{Xi_i=a,Xi=a'} 1 log 


E 

a,a' L \ i=l 

+ X]-^^ 0 ^ 1 ,n(«;«') log 

a,a' 
n 


Q{a'\a) 


-E 

n 


2=1 


E ( ll{Xi_i=a,Xi=a'}log 


7r(a') 

7r(a) 7r(a0 PxoXun{a,a') 

PxoAa) Pxun{a') Q{a’\a)x{a) 

Q(o'|o) 


Tr{a') 


+ D{PxoXi ,n\\PxoXi ) “ -P(-PXo,nlk) “ -C)(Pxi,n|k) 

^ El°g ^ JIvr) - I)(Pxi,n|K), (19) 


n 


2=1 


where Z1(P||Q) = ExgA P(x) log[P(x)/Q(x)] denotes the relative entropy between two discrete 
distributions P and Q on the same alphabet A. As mentioned earlier, the bivariate chain 
{Zn = {Xn-i,Xn) ; n > 1} on A X A is ergodic, therefore it satishes the ergodic theorem, 
the central limit theorem, and the law of the iterated logarithm; see, e.g., [7]. In particular, 
the law of the iterated logarithm implies that, each of the three relative entropies above, when 
multiplied by ^/n, converges to zero a.s., as n —>■ oo. Indeed, a Taylor expansion shows that 
—y/nD(Pxo,n\\'^) is equal to, 


VnY^^oAP log 

aeA 


1 + 


7r(a) 


Pxo,n{a) 

a&A 


- 1 


7r(a) 


Pxo,n(a) 

7r(a) 


-1 - 


1 / 7r(a) 


- 1 


2 1 


aGA 


E 

a£A . 


- 1 


‘^^PXo,n{^) 
2 1 


PXq ,nisPj^nP^ 


PXo,n{^) in{aY 

(Pxo,n(a) -vr(a))2 


for some (random) ^n(a) between 1 and 7r(a)/Pxo,n(fl)- By the ergodic theorem, (l/Pxo,n(a)) 
and (l/^„(a)^) are both bounded a.s., and from the law of the iterated logarithm we have that 
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{Pxo,nia) — 7r(a))yr7y(k)glogn) is also bounded a.s. Therefore, writing the above expression as, 


log log n 
2y/n 


E 


asA . 


Pxo,n{a)^n{aY 


(-Pxo,n(a) -vr(a))" 


n 


log log n 


( 20 ) 


we see that each term in the sum is a.s. bounded, and, therefore, the entire expression tends to 
zero, a.s., as n —>■ oo, as required. 

The same argument shows that, after multiplication by y/n, the other two relative entropies 
in CSl) also tend to zero, a.s., as n ^ oo. Therefore, a.s. as n —>■ oo. 





2=1 



V ) 




+ o(l), 


( 21 ) 


and the result of part (i) follows immediately by an application of the central limit theorem 
[71 Sec. 1.16] to the bivariate chain {Zn}- The existence of the limit in the definition of 
is guaranteed by [71 Theorem 3, p. 97], and its finiteness follows easily from the fact that the 
alphabet is finite and the random variables {log ((5(Xi|Xj_i)/7r(Xj))} are uniformly bounded. 

For part (ii), first recall the hypothesis testing setup presented in Section [2l^ Each transition 
matrix Q = Qq on A with positive transitions is indexed by a parameter 9 in the following open 
subset of 


0 = < 0 G : 9ij > 0 for all i,j, and ^ 9ij < 1 for each i 

The entries of the transition matrix Qq corresponding to a given parameter 0 G 0 are, 

QeO'W = I 1 _ y- Q if I - m I ’ 

I 1 ilj—m J 

Similarly, the null hypothesis is specified by the open set, 


( 22 ) 


‘h = < i;/) G M"* ^ : (f)j > ^ for all j, and (j)j < 1 

y 1 

which is naturally embedded within 0 via the map /i : ^ 0, where (f> 6 = h{(l)), with 

6ij = 4>j for all i,j. 

Now, recall the result of Proposition 12.41 stating that our quantity of interest, InlniPx^Xi)^ 
is equal to twice the log-likelihood ratio A„ defined in ([8]). Then, the claimed convergence of 
2nIn{PxoXy) i® exactly the result stated as (the last part of) the conclusion of [SJ Theorem 5.2]. 
For that, we only need to verify the two assumptions of that theorem. 

For the first assumption, we note that, in the notation of [5], the size s of the alphabet 
s = m, the dimensionality r of 0 is r = m{m — 1), and the dimensionality c of is c = m — 1, 
so that the limiting distribution of A„ has r — c = (m — 1)^ degrees of freedom. For each 
(*; j)) — 1, the component of h given by hij{(f) = (f>j is certainly 

three times continuously differentiable with respect to each 0^, l<£<m — 1. Moreover, the 
m{m — 1) X (m — 1) matrix K{(f) with entries, 

_ dhij{(j)) _ _ 

del) I d(l)e 
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where 6ji - 6, equals 1 if j ^ and 0 otherwise, has rank c = m — 1 throughout This shows 
that the first assumption we needed, namely. Condition 3.1 on [5l p. 17], indeed is satisfied. 

Similarly, for the second assumption let D = Ax A and write d = \D\ = m?. For each (f,j), 
I < i, j < rn, the component of the transition matrix Qg corresponding to (i, j), QeiJli), is cer¬ 
tainly three times continuously differentiable throughout 0. Moreover, recalling the parametriza- 
tion (| 22 p . consider the m? xm{m — l) matrix/C(0) with entries, for 1 < i, j,f’ < m, 1 < A: < m—1, 

dQg{j\i) f Aj^rri] 

Clearly /C(0) has rank r = m{m — 1 ) throughout 0, which shows that the second assumption 
we needed to check, namely Condition 5.1 on O p. 23], is also satisfied, thus completing the 
proof. □ 


A.2 Proof of Proposition 

This result is a consequence of [201 Theorem 3]; see also the discussion in jH]. To see that, 
observe that, under the assumptions of the proposition, 7 r(Xj) = 1 /]^] for all i and therefore, 


cr^ = lim — Var 

n^oo Tl 


log mQ(XiiA 


j-ij 




which is equal to the variance cr^ dehned in 1201 eq. (3.2)]. The stronger version of Theorem 3 
established in the last section of |20j implies that is only zero when there is a constant (7 > 0 
and a positive vector {v{a) ] a & A) such that. 


Q{a'\a) = q 


v{a') 


for all a, a' G A. 


But since all Q{a'\a) are assumed to be strictly positive here, we can fix an arbitrary a € A and 
sum the above expression over all a' ^ A to obtain that q/v{a) should be equal to 1 . Since a 
was arbitrary, this means that the vector v{a) is constant over a, and the result follows. □ 


A.3 Proof of Corollary 12.31 


First we note that, a careful examination of the proof of Theorem 12.11 part (i) shows that the 
entire argument remains valid for both cases / 7 r(Xo;Xi) > 0 and / 7 r(Xo;Xi) = 0 , and also, 
without the positivity assumption on the transition matrix, as long as the chain X is irreducible 
and aperiodic, since it is a standard exercise to show that the bivariate chain {Zn} is still ergodic 
in this case. 

In view of the above remarks, we observe that the computation leading to ()2np implies that, 
as n —)■ 00 , 


n 


log log n 


D{Pxo,n\\T^) = O 


I log log n 


n 


0 ( 1 ), a.s., 
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and the same holds for each each of the three relative entropies in (I19p . Therefore, m becomes, 


n 


log log n 


[in{Px,xJ 


^/n\og\ogn 


E 

2=1 


log 


vr(Xi) 


-4(Xo;Xi) 


+ o(l)) 


a.s., as n —>■ oo, and if ci^ > 0 then an application of the law of the iterated logarithm [7] gives 
the claimed result. Finally, we note that, in view of the result in |26l Theorem 17.5.4], the same 
conclusion holds in the case cr^ = 0. □ 


A.4 Proof of Proposition 12.41 

The first maximum in the definition of A„, in view of ([7|), can be expressed as, 

n n 

maxL„(Xo;6') = maxlog ( IT Q6»(A'i|Aj_i)') = maxlog (Q{Xi\Xi_i)), 

2=1 2=1 

where the last maximum is over all transition matrices Q with all positive entries. Therefore, 


maxL„(Ao,6l) max E nPxoXi,n{a., a!) log(Q(o'|a)) 


Q 


= -nnhn^T) (PxoXi, 


Q ® Pxo,nj - ^2 ^XoXi,n{a, a') log I 

a,a' \ 


I PxoXi,n{a,a') 


Pxo,n{a) 


where Q ® Px,n denotes the bivariate distribution {Q ® Px,n){cL, a') = Pxo,nia)Q{a'\a). Clearly 
the above expression is minimized when the relative entropy term is zero, namely, when Q(a'|a) = 
PxoXi,n{a, a')/Pxo,nia), so that, 

maxL„(Ao";0) = V («,«') log = n[H{Px,,n) - H{Px,Xun)]- 

tP' Pxo,n{a) ’ 

A similar (and simpler) computation, yields that the second maximum in ([8]) is, 

maxL„(Xo ;= -ni7(Px „). 

Combining the last two equations with the definitions of A„ and IniPxoxP gives the claimed 
result. □ 


A.S Proof of Proposition 13711 

The existence of the entropy rate and the two expressions for H(Y) in (i) can be established in 
several ways. For example, it is easy to check that the bivariate process {{Xn,Yn)} is asymp¬ 
totically mean stationary (AMS) [13]. Then the univariate process {Fn} can be viewed as a 
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stationary coding of {{Xn,Yn)}, and as such it is also AMS [12]. The results of {i) then follow 
immediately from the general results in |12j . 

For (ii), note that we can write, 


i=l 


^HiY,\Xl,Yl-^)+ [H{X„Y,\XI-\yY) - H{X,\XI-\yY)] 


i=l 

k 


i=k-\-l 


Y,H{YAx\,Yr^) + ^ \H{XuYi\xt:l,Y;:l)-H{xAxl^,Y;:l)\ 


i=l 

k 


i=k-\-l 


Y[H(Y,\Xl,YY)-H(Y,\XU,Y;z^)]+YHmXU,YiZ^)- ( 23 ) 


i=l 


i=l 


By ergodicity and the continuity of the conditional entropy functional for finite-alphabet distri¬ 
butions, we have that H{Yi\Xl_i^,YY) H{Yq\X^i^, Y^^) as n — )• oo. Therefore, dividing ([231) 
by n and letting n —>■ oo, the first term vanishes, and the Cesaro averages in the second term 
also converge, 

1 ” 

- H{Yi\XU, yY) ^ H{Yo\X\^ YZI). 

i=l 

The result in (ii) follows from this combined with (i) and the definition of the directed informa¬ 
tion rate. 

Finally if, in addition, the process {Y^} is itself a Markov chain of order (no larger than) k, 
then its entropy rate is simply, H{Yo\YZk)^ the result in (iii) follows trivially from (ii). □ 


A.6 Proof of Theorem 13.31 

Both parts of the theorem will be established along the same lines as the proofs of the corre¬ 
sponding results in Theorem l2.lt for that reason, some minor details in the computations below 
will be omitted. 

A.6.1 Proof of (i) 

For the Gaussian convergence in part (i), recalling the definitions of the empirical Pxo^Y°^.n 
and of the plug-in estimator in ([9|) and (fTOl) . respectively, we first express, in\x —)• Y) as a 
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mutual information 




“0’°0 


P. 


^X\Yo\YZl,ni^’^'^k\bQ 


XO^|yj^\n(®ol^O ^)'^Yb|yrfc ^) 


E Px^,i.(»„.^„)log 4.^,.. J-.)P, 


ag,feg 


^-k^-k’"-' ''' ” ' '-fc’ 

-k 




E ^’x5,Fi.„(«o.'>o)log I Pj,^| ,(6j|6j-i)F^ .(„J|6J-1) 


ag.feg 


(24) 


ag,feg 


+ X] 1 Px\Y\,n{O'0^bl) 

-PxOj^yOj,,n(®o’(®o>^0 ^)'^y“fe(^o)\I 

_-^n(«0,&o"')%,n(&g) ' Pxo,yoJag,6g)PE-(6r') j J 


log 


p. 


(25) 


Substituting the definition of P^o y-o in (l2^ . simplifying, and expanding the logarithm in (1251) 

—k^—k ’ 

a sum of four logarithms, we obtain that in\x —)• Y) equals. 


Px^,%\YZ^,X-kXo\Y:l) 


n 


J^log 


2=1 






+ D P^O ^yO^ 


Pv 






p^ 


y- 


k\-‘-k 

-d(p 




j P [PyX^ 


Pv 




(26) 


We now claim that, each of the four relative entropies above, when multiplied by ^/n, converges 
to zero a.s. as n —>■ oo. First recall that, as stated in the beginning of Section [3.21 the chain 
{Zn = (A'yi,F”_j)) on A‘+1 X pk+i 

is ergodic, so we know that it satisfies the ergodic 
theorem, the central limit theorem, and the law of the iterated logarithm [7]. As we argued 
in the corresponding steps in the proof of Theorem 12.11 a quadratic Taylor expansion of the 


logarithm in the definition of D ( Py-i 

— k ’ 


Py-i) gives, 


P 






E p^ 


^y-(^o"') 




w E v.„(r‘) 




log log n 
2^/n 


E 



- 1 






k-1 


«„(T‘)2 



n 


log log n 
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for a (possibly random) ^nib^ between 1 and Py-i(6g ^)/Py-i (6g ^). Now, as in the proof 

—k —k ’ 

of Theorem EH the first term in the last sum above is bounded a.s. by the ergodic theorem, 
and the law of the iterated logarithm implies that the product of the next two terms. 



(b 


k-l 

0 


) - Py-. 



n \ 
log log n J 


Vn log log n 


n 


(27) 

is also bounded a.s. So, each summand is a.s. bounded and, therefore, the entire expression 
tends to zero, a.s., as re ^ oo, as claimed. 

Exactly the same argument shows that, after being multiplied by y/n, the other three relative 
entropies in (126)1 also tend to zero, a.s., as re ^ oo, so that, a.s.. 


n 

= — T 

, /rT ^ 


Y)-i(n-,xo_,\Yzi) 


2=1 


log 


P%\y.P (^o\yz,^)Pxo_^\y_p ix^-k\yPk) 




+ 0(1). (28) 


The Gaussian convergence in part (i) follows by an application of the central limit theorem [3 
Sec. 1.16] to the above partial sums of a functional of the chain {Zn}- The fact that the variance 
in m exists as the stated limit follows from [3 Theorem 3, p. 97], and the fact that it is finite 
is a direct consequence of the fact that the alphabets A and B are finite, and that the random 
variables {log(+)} being summed in (I28p are uniformly bounded. □ 


A.6.2 Proof of (ii) 

Recall the hypothesis testing setup of Section 13.31 In that notation, the class of all possible 
transition matrices Q = Qo is parametrized by an rrPi^{mi — l)-dimensional vector. 


e = [e. 


Jfc,* U 




where 1 < ii,i 2 , ■ ■ ■ Pk < m, 1 < ■ ■ ■ ,jfe < P and e Ax B, p (rre,f); we have 

used the same notation as before for strings of symbols, G A^ and G B^. In order for each 
0 to correspond to a transition probability matrix Qq with positive transitions, we restrict 6 to 
the following open subset of 


0 = J 0 E . 


E 




{i' ,j')^{rn,€) 

so that the entries of the corresponding Qq are. 


Qe{i\j'\tpjt)= ^ 1- if (*',/) = 

{u,v)j^{m,£) 


(29) 
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The null hypothesis is described by the parameter space <h = x T^, where, 


r" = I 7" = ,0 G (0, 7 | < 1, for all i\ E A\j\ E , 


l<2'<m—1 


YV = = ) e (q, ■. y y < l, for all e 




/ V / 

* ^ j\ i3 


Clearly $ is an open set of i)+(^ i)] dimensions, which can be naturally embedded 

within 0 via the map /i : $ ^ 0, where each component of h{(l)) = /?,( 7 ^, 7 ^) is, 


h^k Ak ,/ , 

>Ji 5* J 




(30) 


for all components with i' ^ m and j' ^ with the obvious extension to the two ‘edge’ cases 
{m,j') and 

In order to establish the y^-convergence stated in the theorem, recall Proposition 13.51 which 
states that 2n times the plug-in estimator, 2nin\x Y), equals the log-likelihood ratio 
An defined in (1181) . The claimed result for the plug-in is an immediate consequence of the 
corresponding convergence result for A„ established by Billingsley in [5l Theorem 6.1], where 

= XtA+l {(j)) in the notation of [5]. 

To apply Billingsley’s result we only need to verify its main assumption, specifically Con¬ 
dition 6.1 on [5l p. 33], which requires that, throughout the matrix Qh{<p) has continu¬ 
ous third order partial derivatives, and that the matrix C{<j)) defined below has rank c = 
— 1) + {t — 1 )]. 

To that end, we note that, in the notation of [5], the size s of the alphabet s = mH., the 
memory length of the chain t = k, and the full parameter space Ht = Q has dimension r = 
— s* = — 1). And since dimensionality c of the parameter space $ for the null 

hypothesis is c = — 1) + {I — 1)], the limiting distribution of has. 


r — c = — 1){£ — 1) degrees of freedom. 


By the definitions of and the map h it is obvious that each component of the matrix 
Qh{(fi) has third-order partial derivatives with respect to every component of cj). Now consider 
the X c = x £^[m^{m — !)-{-{£ — 1)] matrix £(</>), with entries given by the partial 

derivatives of each component Qh{<}>){i' of Qh{(f>) with respect to every component of 
4> = (7‘^,7^). Then the row of T(<^) corresponding to consists of all. 


followed by all. 


dQh{y 


dj 






djy 


v'ly 


(31) 


(32) 
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In view of (f2U|l and (fHOI) . the derivatives in (fSTD are equal to zero unless and = vf, in 

which case, 







T.J" 

if i' = u', i' ^ m and j' 7 ^ f , 

V 

-Yk 

J\ iJ 

if u' m,i' = m and j' 7 ^ i, 

v^i 

if i' = u', i' ^ m and j' = i, 

V l^k - 1 , 

if u' ^ m,i' = m and j' = i, 

0 , 

in all other cases. 


Similarly, the derivatives in 


'vl^v' 


are equal to zero unless j\ = v\, in which case, 
7 ffc if j' = v',i' and f / £, 

1 - ^ 7 | if f = v\i' = m and j' / £, 

u^m 

, if u' 7 ^ i' / m and f = £, 

X] if u' / = "T and f = £, 

u^m 

0 , in all other cases. 


= 


Since all the components of all the parameters 7 * and 7 ^ are strictly positive, a careful (and 
tedious) examination of the above expressions reveals that the matrix C{4i) has rank c = 
£^[m^{m —!) + (£— 1)] throughout $. Therefore, Condition 6.1 of [3] holds as claimed, thus 
completing the proof. □ 


A.7 Proof of Corollary 13.41 


In view of Proposition 13.11 clearly it suffices to prove the first two assertions (fT^ and (fTT)l of 
the corollary. We begin by noting that, as long as the pair process {(A„,Y„)} is an irreducible 
and aperiodic Markov chain of order k, then a standard exercise to show that the chain {Zn = 
ergodic. Then, an examination of the proof of Theorem 13.31 part (i) reveals 
that the entire argument remains valid for both cases I{Yo; > 0 and /(To! ^-k\^-k) ~ 

0, and without the positivity assumption on the transition matrix Q. 

For the a.s.-rate in dH, we observe that the computation leading to (f27)l implies that, as 
n 00 , 


n 


log log n 


D 




O 


log logn\ 


0 ( 1 ), a.s.. 


and that the same bound holds for each each of the four relative entropies in (|26p . Therefore, 
([25]) becomes. 


n 


log log n 


i!t\x 


1 


^ 

y/n log log n 


Y)-!{%■, x\\Y:t 


log 


Puyy (roiuyjp^,, IP-. (X7|F- 


-V.l?-, 




-/(io;AO,|y- 


-|-o(l), a.s. as n —)• 00 . 


23 























Then, if > 0, the law of the iterated logarithm [7] implies (|13p as claimed, and if = 0, the 
same conclusion follows by [261 Theorem 17.5.4]. 

For the rate in (flTl) . we first recall the expression for the plug-in estimator in (1261) and 
claim that each of the four relative entropies there converge to zero at a rate 0(l/n) in L^. 
To see this, consider D(Py-i ^||Py-i); a first-order Taylor expansion for the logarithm in its 
definition gives, 


D 




(b 


k-l 

0 


) 




where, for those 6 q ^ for which Py-i (6 q is nonzero, is a (possibly random) constant 

— k ’ 


Lfc-l 


fc-l\ 


between 1 and Py-i ^)/Py-i(6g ^), while for the remaining 6 q Cn(&o given 

— k ’ —k 


an arbitrary (finite) value. Writing Sn{b^ ^) for the difference Py-i ^(6 q ^) — Py-^ib^ ^), and 


k-l^ 


uk-l\ 


Pn{bQ ^) = [Cn(f'o ^) “ Ij-fy-i (^0 ^)/^n{bQ ^), after some simple algebra we obtain 


uk-l\ 


uk-l 


k-U 


l + Pj6S“l)/Py-l(6r^) 


k-l\ 




Py-; ) = 




i + />„(4S-‘)Sn(C‘)/r'v->(C‘) 


fc-1^ 


kk-1^ 


^k-l 


where now each pnibQ ^) E [0,1]. Using the simple inequality (1 -|- x)/(l -|- px) < 1 -|- x{l — p), 
which holds for all x > —1, for each p € [0,1], gives. 


D 




< E 

,k-l 


^1 + [1 - p„{6S-‘)| 


iv_-yr‘)) 


^ SnjbQ ^)^ 


because each pn is between 0 and 1, and each Sn sums to zero by definition. In order to show 
that this relative entropy converges to zero in at a rate 0(l/n), it suffices to show that each 
term in the last sum does. And for that it suffices to show that for each 6g~^, 

nE[Sn{bQ~^)‘^] = 0(1), as n —)• oo. (33) 

But, as in (f271l . Sn(f^o~^) simply the centered partial sums of a functional of the chain {Z^}, 


Snib’^-^) 



i=l 


so the result of [3 Theorem 3, p. 97] tells us that actually, nE[SnibQ ^)^] converges to a finite 
constant as n —)• oo. This implies (l33l) . and establishes that. 


E 


D 





as n ^ oo. 


The exact same argument shows that the same result also holds for the other three relative 
entropies in (|26D . Therefore, in order to establish the required result in (|14l) it suffices to show 
that. 


E 


-E^°s 

n 




2=1 


P<? 


TolWt (^o|iC,^)Pxo jy- (X^klY-k) 


-/(yo;xo,|y_t) 



(34) 
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But, once again, this can be seen as the norm of the centered partial sums of a functional 
of the chain {Z^}. Therefore, Holder’s inequality combined with the result of [3 Theorem 3, 
p. 97], imply exactly ([5^ . establishing (fT^ and completing the proof. □ 


A.8 Proof of Proposition [37^ 

We proceed along the same line as in the proof of Proposition 12.41 Recalling (1171) . the first 
maximum in the definition of is, 


maxL„(AVi,^-Vi;^) = max^log (q(X„T/ r,!)) , 


where the last maximization is over all transition matrices Q with all positive entries, so that, 




= max ^ nA^o^yo^_„(ag,6g)log(Q(afc,6fc|ao \ ^)) 



“ ^ -PxOj,yOj,,n(®0) ^o) log 

i.k 

“O ’^0 


( Px\Y\,n{<^0^bl) \ 1 


where the distribution, 


{Q ® ^o) ~ ^x~}y~} 


iK ,bQ )Q{ak,bk\% 


k—1 uk—l^ 


5 "^0 


for all Oq G bg G The above minimum is obviously achieved by making the relative 

entropy equal to zero, that is, by taking, 

Q(ak,bklat\bt^) = PjCoYolx:lY-,\n((^k, bklat\ 


so that. 


maxL„(A 

d€0 


n 

—Ai+1’ 


y: 


—fc+ii 


0) 


= n 


HiXZlYZ,^ 


-H 



(35) 


where, {X\,Y\) ~ Px^_^Y\,n- 

The computation for the second maximum in (1181) is a little more involved, as it reduces 
to two different maximizations. But because both of these are very similar to the one just 
computed, we will give an outline of the steps involved without providing all the details. Recall 
the log-likelihood expression in (HD and that, under the null, Q admits the decomposition 
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in (USD. We have: 


max LniX 


n 

—k+1 5 


y - k + i ; 0 ) 


= max 

Q^,Qy 


n 

log («"(Vi|v.y, F/r.'jQviioiy.-')) 

2=1 


= my ^ log (Q-iXAXizl, V't^)) + my log (Qi'lFill'tt)) 


Y1 y-w(«0i&0 ^)log(Q^(afc|a^ ^)) 

( J -^ * ^ —k —k ’ 

nk f . fc-l 
“O ’^0 

+ max^ n/Vo^,n(^o) log(Q^(^fc|feo”^)) 


= n 


E 4o v-h.(«s,r^)iog' 


yX -} Y -}. ni ^0 ^ 1^0 


k uk-l 






I 

+ n ^ Pyo ^_„( 6 g)log 




= n 


-H 1 x«_„y:1) + H[xzIy:^) - H (f\) + H [Yzi) 


(36) 


Combining (l35]l and (l36l) and using the chain rule, 


An = 


= 2{maxLn(XVi,>AVi;^) -maxLn(AVi,>AVi;M</-))| 

yGfc ) J 


= 2n 


which, recalling the definition of ii^\x —>• Y), is precisely the claimed result. 


□ 
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